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Abstract 

DIFFERENTIAL  FILES:  THEIR  APPLICATION  TO  THE 

MAINTENANCE  OF  LARGE  DATABASES 


Dennis  G.  Severance 
University  of  Minnesota 

Guy  M.  Lohman 
Cornell  University 


The  representation  of  a collection  of  data  in  terms  of  its  differences 
from  some  pre-established  point  of  reference  is  a basic  compaction  technique 
which  finds  wide  applicability.  This  paper  describes  a differential  data- 
base representation  which  is  shown  to  be  an  efficient  method  for  storing 
large  and  volatile  databases.  The  technique  confines  database  modifications 
to  a relatively  small  area  of  physical  storage  and  as  a result  offers  two 
significant  operational  advantages.  First,  because  the  ■'reference  point’'^ 
for  the  database  is  inherently  static,  it  can  be  simply  and  efficiently 
stored.  Moreover,  since  all  modifications  to  the  database  are  physically 
localized,  the  process  of  backup  and  the  process  of  recovery  are  relatively 
fast  and  inexpensive. 
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1.  INTRODUCTION 

Tne  representation  of  data  in  terms  of  differences  from  a pre-estab) ished 
point  of  reference  is  a data  compaction  technique  with  wide  applicability.  The 
differentia)  codin')  of  satellite  data,  the  pre-execution  merge  of  an  object 
module  with  an  associated  "patch"  deck,  the  invocation  of  recursive  function  calls 
tne  modification  of  a DO-loop  index,  the  concept  of  base-addressing,  and  even  the 
distribution  of  a revised  system  manual  in  the  form  of  errata  sheets  are  all  applf 
cations  of  a conmon  principle:  differential  encoding. 

This  paper  describes  a differential  database  representation  which  is  shown 
to  be  an  efficient  method  for  storing  a large  and  changing  database.  The  power 
of  the  representation  is  derived  from  the  fact  that  all  database  modifications 
are  localized  into  a relatively  small  storage  area,  called  a differential  file. 

By  consolidating  changes  in  this  manner,  it  is  possible  to  reduce  backup  costs, 
speed  the  process  of  database  recovery,  and  even  minimize  the  probability  of  a 
serious  data  loss.  In  addition,  the  technique  can  provide  increased  data  avail- 
ability tdiile  reducing  both  storage  and  retrieval  costs. 

The  paper  consists  of  two  major  sections:  Section  2 motivates,  describes, 

and  analyzes  the  concept  of  a differential  file;  Section  3 presents  ten  specific 
advantages  of  the  idea.  A simvnary  and  an  outline  of  future  research  are  given 
in  conclusion. 
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2.  OlffEREHTIAl  FILES 

2.1  An  analogy.  A differential  file  for  a database  Is  analogous  to  an 
errata  list  for  a book.  Rather  than  print  a new  edition  of  a book  each  time 

a change  in  text  is  desired,  a publisher  will  identify  corrections  by  page  and 
line  nunber,  and  collect  them  into  an  errata  list  which  is  distributed  with 
each  book.  This  procedure  significantly  reduces  publication  costs.  To  reference 
the  corrected  version  of  the  book,  however,  readers  must  consult  the  errata  list 
before  any  reading  of  the  main  text.  An  increase  in  access  time  is  thus  traded 
for  a decrease  in  maintenance  cost.  If  text  changes  are  continued,  the  errata 
list  will  grow  to  a sufficient  length  that  reorganization  costs  are  justified. 

All  changes  would  then  be  incorporated  into  the  book,  forming  a new  physical 
edition. 

Updating  a large  database  poses  a similar  problem.  As  with  a book,  it  is 
generally  simplest  and  least  expensive  to  accunulate  changes  over  a period  of 
time  and  to  post  them  en  masse  when  creating  a new  database  edition  (generation). 
It  is  most  expensive,  as  measured  in  terms  of  storage  costs,  maintenance  time, 
and  overall  system  complexity,  to  directly  modify  the  database  with  each  update 
transaction.  As  a compromise,  a differential  file  can  be  used  like  an  errata 
list  to  collect  and  identify  pending  record  changes.  Consulting  the  differen- 
tial file  as  a first  step  in  data  retrieval  effectively  yields  an  up-to-date 
database.  At  a cost  of  increased  access  time,  system  overhead  may  be  reduced. 
When  the  differential  file  grows  sufficiently  large,  a reorganization  would 
incorporate  all  changes  into  a new  generation  of  the  database,  and  the  now 
empty  differential  file  would  begin  acciaeulating  changes  anew. 

2.2  Earlier  proposals.  The  concept  of  a differential  file  has  been  redis- 
covered many  times.  The  authors  have  benefited  from  numerous  discussions  with 
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colleagues  who  have  seen  the  idea  used  in  one  form  or  another  to  solve  parti- 
cular updating  problems.  Three  documented  systems  will  be  described.  Turnburke 
[16]  outlines  a differential  structure  for  tape  systems  which  is  designed  to 
avoid  the  writing  of  unchanged  data  records  while  sequentially  processing  batched 
updates.  A data  file  is  composed  from  two  ordered  subfiles:  a larqe  collection 

of  read-only  records  is  stored  on  one  tape,  while  a smaller  collection  of  modified 
records  is  maintained  on  a separate  “change-tape" . To  update  the  data  file,  both 
tapes  are  merged  with  a transaction  file  and  a new  change-tape  is  output.  Un- 
changed records  from  the  read-only  tape  are  never  written.  Turnburke  recommends 
data  file  organisation  once  one  half  of  all  records  have  been  modified. 

Roycroft  [13]  suggests  a direct  access  file  organisation  which  also  makes 
use  of  a differential  file  concept  to  process  file  changes.  The  system  addresses 
records  via  a unique  identifier,  and  every  data  reference  passes  through  a data- 
base index  which  points  to  all  records.  Once  created,  the  main  data  file  is 
never  modified,  dew  database  records  are  accessed  through  the  index  but  are 
stored  in  a separate  overflow  area.  All  record  modifications  are  treated  as  re- 
cord additions.  A new  copy  of  the  record  is  created  and  the  index  is  updated  to 
point  into  the  overflow  area.  The  old  record  is  not  destroyed,  but  rather  main- 
tained as  a before-image  and  pointed  to  by  the  new  record.  Roycroft's  primary 
motivation  for  this  technique  is  to  permit  a data  record  to  grow  in  site  as  a re- 
sult of  an  update  without  disturbing  the  positioning  of  neighboring  records. 

m system  with  a similar  structure  is  described  by  Rappapnrt  [1?].  It  was  de- 
veloped to  facilitate  database  recovery  after  an  electrical  power  failure.  Again 
all  database  records  are  accessed  through  a system  index,  and  all  modifications 
are  physically  separated  into  a file  of  changes  called  a MODFIlt.  [ach  changed 
record  points  back  to  its  before-image.  In  the  event  of  a power  loss,  information 
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in  a transaction  log  is  used  in  conjunction  with  the  MODFlLt  to  undo  partially 
completed  update  transactions. 

2.3  A General ization.  Whenever  a record  is  updated  in  either  the  Roycroft 
or  Rappaport  system,  a record  search  mechanism  (Initially  associated  with  only 
the  main  data  file)  is  modified  to  address  a new  record  copy  which  is  stored  in 
what  is  essentially  a different  file.  As  depicted  by  figure  la,  the  current  ver- 
sion of  any  identified  record,  whether  in  the  main  file  or  the  differential  file, 
is  accessed  via  a common  search  mechanism  — the  system  index. 

A generalization  of  this  record  accessing  strategy  is  shown  in  Figure  lb. 
Mere,  given  the  identifier  for  any  database  record,  the  differential  file  is 
always  searched  first  for  that  record;  in  the  event  that  the  record  is  not  found, 
it  is  then  retrieved  from  the  main  data  file.  Implicit  in  this  diagram  is  the 
fact  that  each  file  may  utilize  a separate  search  mechanism.  The  main  file  index 
is  then  static  and  can  be  quickly  recovered  from  a backup  copy  in  the  event  of  a 
loss.  Index  volatility  is  shifted  to  a smaller,  and  therefore  more  quickly  re- 
coverable, differential  file  index. 

To  isolate  the  main  file  and  its  search  mechanism  from  change,  a delay  in 
the  form  of  a differential  file  search  is  paid  for  every  record  retrieval.  If 
the  two  data  files  and  their  search  mechanisms  can  be  assigned  to  different 
devices  and  accessed  via  separate  channels,  then  both  file  searches  may  proceed 
in  parallel  and  system  users  will  not  perceive  the  increased  delay.  When  such 
overlap  is  Impossible,  one  can  expect  the  average  time  of  a data  record  retrieval 
(assuming  a judicious  selection  of  the  differential  file  search  strategy  [IS])  to 
increase  by  the  amount  of  time  reguired  for  a random  access  to  secondary  memory. 
This  additional  access  time  may  be  comparatively  large  and  can  seriously  degrade 
system  performance. 
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2.4  Avoiding  a double-access.  For  operating  environments  in  which  a signi- 
ficant increase  in  retrieval  time  is  intolerable,  Figure  lc  suggests  a modified 
search  strategy  which  uses  a pre-search  filtering  algorithm  to  reduce  the  number 
of  unnecessary  searcnes  of  the  differential  file.  A filtering  scheme,  devised 
by  bloom  [1]  to  detect  the  occurence  of  rare  events,  can  be  used  to  nearly  elim- 
inate unsuccessful  searches.  The  Bloom  technique  associates  the  differential 
file  with  a main  memory  bit  vector  B of  length  M,  and  some  number  X of  hashing 
functions  irfilth  map  record  identifiers  into  bit  addresses.  When  the  differen- 
tial file  is  initially  empty,  all  bits  in  B are  set  to  zero.  Whenever  a record 
is  stored  in  the  differential  file,  each  transformation  is  applied  to  the  record 
identifier  and  each  of  the  X bits  addressed  is  set  to  l.+ 

Ketrieval  of  a database  record  now  proceeds  as  follows.  The  identifier 
of  a record  to  be  retrieved  is  mapped  through  each  transformation  and  the  logical 
AND  operator  is  applied  to  the  X bits  which  are  addressed.  The  resulting  bit 
value  is  either  U or  I.  The  value  0 is  a certain  indication  that  the  most  re- 
cent version  of  the  record  still  resides  in  the  main  data  file;  the  differential 
file  searcn  is  skipped  and  the  main  file  is  immediately  accessed.  A resulting 
value  of  1 is  a probable  indication  that  an  updated  copy  of  the  record  will  be 
found  in  the  differential  file  and  it  is  therefore  searched.  There  is  a possi- 
bility that  this  search  may  prove  fruitless,  since  the  bits  associated  with  a 
given  identifier  might  coincidentally  be  set  to  1 by  mappings  from  other  updated 
records.  Only  in  the  event  of  such  a filtering  error  are  both  files  searched 
during  a record  retrieval. 

It  is  reasonable  to  assume  that  the  cumulative  computation  time  of  the  hashing 
functions  is  insignificant  in  comparison  to  the  time  required  for  an  access  to 
secondary  memory.  Fither  division  [3]  or  quadratic  hashing  functions  [II],  for 
example,  can  generate  several  addresses  quickly. 
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2.5  Double-access  frequency.  The  probability  of  a filtering  error  at  a 
given  point  in  time  is  a function  of  both  the  proportion  of  main  file  records 
which  have  been  modified  and  the  proportion  of  bits  in  B which  are  set  to  1. 
Expected  values  for  all  of  these  quantities  can  be  calculated.  Consider  a data- 
base of  N records.  Assume  updates  are  independent,  uniformly  distributed  over 
all  records,  and  arrive  over  time  at  a fixed  rate  r.  Similarly  assume  the  exis- 
tence of  a collection  of  X hashing  functions  whose  mappings  are  independent  and 
uniform  over  B.  Define  T to  be  the  length  of  time  between  reorganizations  of 
the  main  data  file.  The  expected  proportion  of  distinct  main  file  records,  Rt, 
wnicn  are  updated  during  a time  period  of  length  t is  given  by 


Mrt-  '-'H 


For  various  values  of  record  update  intensity,  rT/N,  Figure  2 depicts  the  growth 
over  time  of  both  R(  and  rt/N.  Respectively,  these  curves  characterize  the  size 
of  a differential  file  which  contains  (a)  only  the  most  recent  image  of  a changed 
record,  or  (b)  an  accumulation  of  all  such  record  images. 

Now  consider  a random  bit  in  B.  The  probability  that  it  has  value  1 at  time  t 
is 

rtX 


•M 


and  therefore,  given  an  Identifier  of  an  unchanged  record,  the  probability  of  a 


error  is 
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X 
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The  unconditional  probability  of  a filtering  error  at  time  t is  thus  given  by 
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2.b  Designing  a Bloom  filter.  The  frequency  with  which  filtering  errors 
occur  can  be  controlled  by  ranipulating  the  values  selected  for  M and  X.  A 
common  design  problem  is  likely  to  be  the  following.  Some  quantity  of  space  M' 
is  available  in  main  memory  and  can  reasonably  be  allocated  as  a bit  vector.  It 
is  necessary  to  determine  the  number  of  transformations  which  should  map  into 
this  fixed  area.  If  the  value  selected  for  X is  too  small,  the  system  will  under- 
utilize the  available  bits.  If,  on  the  other  hand,  X is  too  large  then  most  bits 
will  be  set  to  1 and  the  filter  will  be  ineffective.  Figure  3 illustrates  P^  as 
a function  of  time  for  various  values  of  X.  N and  M'  are  arbitrarily  set  to  rT. 
Each  curve  shows  an  initial  increase  in  filtering  errors,  which  eventually  de- 
cline as  the  proportion  of  unchanged  records  in  the  main  file  diminishes.  As 
tne  value  of  X increases,  superior  initial  performance  deteriorates  at  a faster 
rate  to  inferior  final  performance. 

In  general,  for  a fixed  M' , a number  of  reasonable  objective  functions  might 
be  used  to  select  a value  for  X.  Among  these  are: 

(a)  minimize  P at  a given  time  t', 

(b)  minimize  | Pt  dt, 

(c)  minimize  (maximum  P^)  over  tne  time  interval  [0,T], 

(d)  minimize  j £ T Pt  dt,  subject  to  P p'. 

Intuitively,  one  can  see  from  Figure  3 that  each  of  these  objectives  might  lead 
to  tne  selection  of  a different  value  for  X.  While  the  formal  analysis  required 
by  problems  (b),  (c),  and  (d)  is  beyond  the  scope  of  the  current  paper,  problem 
(a)  is  easily  solved  using  classical  optimization  techniques.  Setting 
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equal  to  zero,  yields  the  equality 


; . M'lnZ. 

* " rV~ 

Since  d1^,  is  positive  at  X,  this  value  of  X minimizes  P^,  for  given  values 


H‘ , r,  and  t’ . 

for  X equal  to  X,  one  can  easily  verify  that  the  expected  nunfcer  of  bits  in 
B set  to  1 at  time  t'  is  M'/2  (in  agreement  with  a fundamental  theorem  of  informa- 
tion theory  and  a similar  result  obtained  by  Bloom),  and  that  the  probability  of 
a filtering  error  at  this  time  is 


In  practice  X must  be  integral  and  the  two  integers  nearest  X should  be  checked  for 

minimal  P ,,  It  can  be  shown  tnat  one  of  them  must  be  the  optimal  value  of  *. 

7 1 

Applying  these  results  to  a specific  problem  with  N = 10  , r = 10  , T 5, 

M * 2.:>  * 10*,  and  t#  = T,  one  calculates  X = 3.47  and  * .0905.  For  X 3 and 
4,  Pj  takes  on  values  .0918  and  .0919  respectively.  Thus  three  transformations 
used  in  conjunction  with  the  given  3125  byte  81oom-vector  is  expected  to  produce 
fewer  tnan  one-in- ten  filtering  errors  at  the  time  of  reorganwation.  (The  average 
error  rate  over  time  is  approximately  one-in-thirty. ) A differential  file,  there- 
fore, will  not  appreciably  affect  average  retrieval  time  for  this  database. 
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3.  ADVANTAGES  OF  A DIFFERENTIAE  FILE 

The  history  of  database  development  efforts  shows  design  simplicity  to  be 
a dominant  characteristic  of  successful  Implementations.  And  although  the  notion 
of  a differential  file  Is  conceptually  rather  simple,  in  practice,  any  additional 
system  complexity  must  be  justified  by  tangible  benefits.  The  potential  advantages 
of  a differentially  organized  database  are  not  widely  appreciated;  even  existing 
systems  are  rather  narrowly  activated.  This  section  therefore  collects  and 
discusses  ten  general  benefits  that  can  be  realized.  Six  relate  to  database 
Integrity  and  show  that  a differential  file  can  reduce  backup  costs,  speed  recovery 
and  even  minimize  the  chances  of  a serious  data  loss.  The  final  four  advantages 
are  operational;  a differential  file  can  provide  Increased  data  availability  and 
simultaneously  reduce  storage  and  retrieval  costs.  In  total  these  benefits  con- 
stitute a strong  argument  for  a much  wider  use  of  differential  files,  especially 
for  the  maintenance  of  very  large  databases. 

3.1  Reduces  database  dime! no  costs.  In  general,  to  recover  a database  which 
has  been  physically  damaged,  some  form  of  roll-forward  procedure  (Yourdon  [17])  is 
employed:  The  status  of  the  database,  saved  at  a previous  point  in  time  Is  first 

reloaded;  the  emulative  effect  of  all  update  transactions  processed  since  that 
time  Is  then  re-established  via  some  abbreviated  form  of  reprocessing.  The  fre- 
quency with  which  the  database  is  copied  to  Its  backup  vile  Is  a critical  para- 
meter In  designing  such  a procedure  (Chandy,  et  al..  [4],  Drake  and  Smith  [6]). 
Frequent  dialing  permits  fast  recovery,  but  is  associated  with  a high  system  over- 
head. 

Since  the  tine  required  for  a dump  is  proportional  to  the  volume  of  data 
copied  [8],  a differential  file  can  drastically  reduce  the  cost  to  backup  a 
large  database,  particularly  when  the  proportion  of  records  changed  during  a 
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backup  period  is  small.  Consider,  for  example,  a database  with  107,  500-charac- 
ter  records  stored  in  track  size  blocks  on  an  IBM  3330  disk  facility.  Suppose 
updates  are  applied  five  days  per  week,  10  hours  per  day  at  a rate  of  100  changes 
per  hour.  Using  a fast  dump/restore  utility  [8]  a full  database  dump  would  re- 
quire over  six  hours  to  acconplish.  On  the  other  hand,  even  after  a full  week  of 
processing,  a differential  file,  its  bit  vector,  and  a reasonable  search  mechanism 
could  be  dumped  in  less  than  two  minutes.  In  total,  they  would  occupy  less  than 
300  tracks  of  storage  as  compared  to  51  disk  packs. 

3.2  Facilitates  incremental  dumping.  It  is  sometimes  impractical  to  dump 
an  entire  database  at  one  time.  An  incremental  dumping  strategy  (Sayani  [14], 
also  called  "differential  disk  dumping"  by  Yourdon  [17])  will  sequence  through 
physical  sections  of  a database,  periodically  dumping  each  section  which  has 
changed.  A differential  file  implementation  in  which  new  records  are  sequentially 
allocated  in  secondary  memory  (for  example,  Rappaport's  system  [1?])  lends  itself 
naturally  to  such  a strategy.  To  provide  a complete  database  backup  at  any  point 
in  time,  one  need  only  append  to  the  current  backup  file  those  differential  file 
records  created  since  the  last  dump.  With  each  incremental  dump,  one  might  also 
choose  to  save  the  current  status  of  the  differential  file  bit  vector  and  search 
index.  Alternatively,  both  could  be  recovered  with  a single  scan  of  the  restored 
differential  file. 

3.3  Permits  both  realtime  dumping  and  reorganization  with  concurrent  updates. 
Since  a dimp  represents  the  instantaneous  status  of  a database  at  a fixed  point 

in  time,  conventional  backup  procedures  will  prohibit  all  changes  while  this  snap- 
shot is  developed.  By  dumping  only  a small  differential  file,  the  time  during 
which  update  transactions  are  prohibited  can  be  substantial ly  reduced.  More 
importantly,  one  can  avoid  completely  the  need  to  inhibit  change  by  building  a 
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di f ferenti a 1 -di fferential " file  to  store  record  updates  which  are  generated 
during  the  differential  file  dumping  process.  For  most  applications,  this  file 
will  be  quite  small  and  reasonably  held  in  main  memory.  Acting  as  a "cache" 
store  during  the  dunp,  it  would  be  scanned  before  every  retrieval.  When  the 
dump  is  complete,  its  records  would  be  incorporated  into  the  main  differential 
file.  Clearly,  the  same  basic  idea  will  permit  online  reorganization.  Since 
the  generation  of  a new  main  file  might  require  a significant  amount  of  time, 
the  differential-differential  file  would  be  maintained  in  secondary  mencry.  It 
replaces  the  old  differential  file  when  reorganization  is  complete. 

These  procedures  for  dumping  and  reorganizing  a database  are  particularly 
appropriate  for  applications  such  as  airline  reservation  systems,  which  require 
24-hour,  online  availabi 1 i ty,  but  tdiich  experience  periods  of  reduced  traffic 
intensity.  Without  locking  out  updates,  either  procedure  could  be  activated 
during  a slack  period  and  would  act  to  level  the  system  load. 

3-4  Speeds  recovery  from  a "soft"  data  loss.  Damage  to  storage  hardware 
is  not  the  only  cause  of  data  loss.  A user  program  may  Incorrectly  modify  a 
database,  or  a program  error-f  a system  deadlock  or  a machine  failure  may  abort 
the  processing  of  an  update  transaction  in  the  midst  of  a multi-step  procedure 
(such  as,  a transfer  of  funds  between  bank  accounts).  The  content  and/or  struc- 
tural integrity  of  the  database  may  be  damaged  by  either  type  of  error.  Rappa- 
port's  Vehical  and  Driver  Information  System  [12]  provides  a working  example  of 
a differential  file  (maintained  in  the  form  of  an  online  after-image  log)  which 
permits  rapid  system  recovery  through  the  rollback  of  incorrectly  processed  or 
partially  completed  transactions. 
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3.5  Speeds  recovery  from  a hard  data  loss.  As  described  above,  the  use  of 
a differential  file  can  dramatically  reduce  the  cost  of  dumping  a large  database. 

An  inexpensive  dump  procedure  can  be  invoked  frequently,  which  will  in  turn  re- 
duce the  average  nunber  of  changes  to  be  reapplied  in  the  event  of  a database  loss. 

3.6  Reduces  the  risk  of  a serious  data  loss.  When  recovering  a database, 
a properly  tuned  dump-restore  utility  can  reload  a physical  dump  at  nearly  the 
maximum  transfer  rate  of  available  hardware  (on  the  order  of  105  characters/second). 
The  major  portion  of  recovery  time  is  then  spent  individually  reapplying  updates 

to  a small  fraction  of  the  restored  records.  This  small  subset  of  changed  records 
constitutes  an  "Achiles'  heel."  Traditional  update-in-place  file  organizations 
distribute  changed  records  widely  over  secondary  memory;  this  practice  guarantees 
that  even  localized  physical  damage  (e.g.,  a track  loss  or  head  crash  on  a single 
device  of  a very  large  database)  will  require  a lengthy  recovery  procedure.  By 
concentrating  updates  in  a small  physical  area,  a differential  file  offers  three 
potential  advantages: 

(a)  The  critical  exposure  area  of  a database  is  minimized. 

Host  physical  damage  can  be  quickly  repaired  with  a 
localized  backup  copy  procedure. 

(b)  The  critical  area  may  be  allocated  to  a more  reliable 
device  type  than  is  practical  for  the  larger  main  file. 

(c)  The  small  critical  area  may  be  duplexed  to  provide  the 
most  valuable  redundancy  for  a marginal  Increase  in  op- 
erating costs. 

3.7  Supports  "memo  files"  efficiently.  Accurate  online  updating  of  a database 
requires  complex  software  to  provide  multiuser  access  control  and  to  assure  data 
recoverability  (King  and  Collmeyer  [9]).  To  avoid  the  substantial  overhead  asso- 
ciated with  such  software,  many  "online"  systems  will  actually  batch  updates  for 
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end-of-day  processing.  Inventory  control  systems,  for  example,  can  generally 
tolerate  some  loss  of  accuracy  during  the  batching  cycle  provided  data  integrity 
is  re-established  with  each  batch  run.  In  systems  where  a predictable  informa- 
tion lag  might  be  exploited  (e.g.,  banking  or  stock  quotation)  the  memo  file 
concept  of  Davis  [5]  can  be  used  to  maintain  "probably-accurate"  data  without 
the  need  for  complex  software.  The  idea  is  to  permit  software  which  does  not 
defend  against  improbable  events  (such  as  concurrent  update,  system  failure,  head 
crash)  to  update  a "scratch  pad"  copy  of  the  database.  At  end-of-day  the  copy 
is  discarded  and  the  updates  are  reapplied  to  the  "real"  database.  The  use  for 
a differential  file  here  is  obvious. 

3.S  Simplifies  software  development.  Since  the  main  data  file  and  its 
associated  index  are  unaffected  by  updates  in  a differential  file  system,  this 
efforts  a natural  environment  for  the  development  and  testing  of  new  data  pro- 
cessing software.  Using  two  differential  files,  one  can  imagine  a developmental 
system  and  a production  system  running  in  parallel  with  both  accessing  the  same 
main  file  but  modifying  their  own  differential  files.  To  debug  new  software, 
online  comparisons  could  be  made  between  the  data  values  maintained  by  both  systems. 
For  very  large  databases,  where  it  is  either  (1)  infeasible  to  create  a duplicate 
copy  of  the  database  for  experimentation,  or  (2)  it  is  at  least  impossible  for 
both  copies  to  be  online  simultaneously,  this  use  of  differential  files  is  parti- 
cularly Important. 

3.i  Simplifies  main  file  software.  Because  the  main  file  is  static  between 
reorganizations,  the  structures  required  for  its  storage  are  inherently  simple 
and  efficient.  Neither  free  space  nor  record  linkages  are  allocated  to  accomodate 
record  growth,  and  a greater  density  of  data  storage  can  be  achieved  when  the 
database  is  Initially  loaded  [13].  Since  the  main  file  is  read-only,  multiple 
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ac:ess  requests  may  be  handled  concurrently  without  requirinq  the  use  of  a 
couple*  protocol  to  avoid  deadlock  or  prrors  due  to  simultaneous  write  access 
(Brinch  Hansen  [2]).  Thus  if  a user  prnqram  requires  access  to  data  which  is 
either  (1)  known  to  be  constant  or  (2)  relatively  stable  and  absolute  currency 
is  non-critical , then  such  requests  may  bypass  the  differential  file  and  safely 
access  only  the  main  file  without  queueinq  for  private  access. 

3.IU  Reduces  future  database  storage  costs.  Within  the  next  decade,  tril- 
lion bit  random  access  mass  storaqe  devices  will  be  provided  by  at  least  one  of 
several  competing  technologies.  The  cost  for  a dynamic  read-write  capability  is 
expected  to  be  an  order  of  magnitude  higher  than  the  cost  of  a read-only  memory/ 
The  application  of  the  differential  file  concept  in  such  an  operating  environment 
is  obvious:  the  large  main  data  file  i£  read-only.  The  cost  reduction  which 
differential  files  provide  will  greatly  enlarge  the  realm  of  feasible  computer- 
based  information  systems. 

4.  SIMWRY  AND  FURTHER  RESEARCH 

A differential  file  is  an  efficient  representation  of  database  updates.  By 
consolidating  modifications,  and  physically  isolating  them  from  the  main,  read- 
only daU  file,  it  is  possible  to  reduce  backup  costs,  speed  database  recovery, 
avoid  serious  data  losses,  increase  data  availability,  decrease  storage  costs  and 
speed  retrieval  operations.  The  paper  provides  a nuntier  of  arguments  which  should 
activate  a wider  use  of  differential  files,  particularly  for  the  maintenance  of 
very  large  databases. 

* Based  upon  presentations  made  in  May  1975  by  the  Panel  on  Future  Architecture 
at  the  Very  Large  Data  Base  Conference,  Framingham,  MA. 
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The  design  of  a differential  file  system  Involves  tradeoffs  among  the  costs 
of  update,  retrieval,  storage,  backup,  and  recovery.  Implementation  questions 
currently  under  investigation  by  the  authors  Include: 

(1)  What  data  should  be  stored  in  a differential  file?  Should  it  con- 
tain complete  data  records,  as  suggested  here,  or  only  data  fields  which  have 
cnanged.  Should  old  versions  of  a differential  file  record  be  overwritten  or 
retained  as  a before- image? 

(2)  How  frequently  should  differential  file  backup  and  reorganization  occur? 

(3)  How  does  differential  file  size  and  filter  error  probability  grow  with 
the  non-random  arrival  of  non-uniform  updates?  How  should  the  filter  parameters 
N and  X be  selected? 

(4)  Given  a hierarchy  of  storage  devices,  at  what  levels  should  the  main 
file,  differential  file,  search  mechanisms,  and  bit  vector  be  stored? 

(5)  How  can  differential  files  be  used  to  facilitate  maintenance  of  dis- 
tributed databases? 

Formal  statements  of  these  problems  are  extremely  complex  and  their  solution  is 
difficult.  When  solved,  however,  the  results  may  have  significant  practical  value. 
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